The original data is stored in the file
lxb-wz2c-m46p (1).xlsx. After excluding the first 100 rows,
a random sample of 100 entries was selected from the remaining data with
seed 2025. Based on the human brain and ground truth, a summary of text
comment of original data was extracted, recorded in total 13 features
Document.ID, Tone, Tone_justify,
Tone_quote, Commenter_Role, Specialty,
Patient_cost, Patient_quality, Provider_pay,
Provider_quality, Other, Justification,
Quote, Role_category into a CSV file named
fill_sample.csv. it contains 13 discrete variables and 100
rows, totaling 1,300 cells. Among them, 18.1% of the cells are
empty,meaning no useful information could be extracted.
Column-wise Missingness:
The Commenter_Role column has a missing rate of 40%, suggesting
limited role identity information.
The Specialty column, which reflects sub-identity of
corresponding commenter’s role , is missing in 80% of entries.
The Other column, which captures other extractable information
from comments, has a 74% missing rate.
I recommend paying close attention to the Specialty and
Other columns, as their high missingness may affect the
robustness of downstream analyses, depending on the
study context.
Row-wise Missingness:
the 4 rows with the most missing values share one common trait: their
tone is classified as “very negative”.
the 4 rows with the full information (Rows 15, 37, 69, and 83) share one
common trait: their roles are tend to not be non-professional
roles, e.g.patient, suggesting that professional identity may
correlate with more complete information.
Description:
The tone of a comment is categorized as
very_negative, negative, neutral, or
positive. The tone_justify provides the reasoning
behind this classification based on linguistic metrics. These include
negative_word (presence of words with negative connotation
like not, destroy), strong_word (forceful
terms like must, stop), imperative (use
of imperative sentences), capitalization (use of
capitalization for emphasis), punctuation (presence of
punctuation mark such as ! or ?),
implicature (present of indirect expression or implied
negativity ), objective (whether is fact-based), and
supportive (whether the comment expresses encouragement or
approval). high_frequency (it measures how often these
features occur within the comment, either by calculating the proportion
of sentences containing such features or noting repeated use of the same
indicator).
Tone Distribution:
“Percentage of Each Category in Tone” pie chart shows Negative and very
negative tones have roughly equal proportions (43% and 45%,
respectively). In contrast, neutral and positive tones are much less
common (7% neutral, 1% positive). This indicates that the majority of
people in the sample express a negative attitude toward PE.
Correlation between Tone and Tone_justify:
According to the chi-square test \(p<0.05\), Tone-justify is statistically
significant associated with Tone, showing the tone_justify we defined is
reasonable. Heatmap which based on scaled Contingency table, shows the
proportion for each Tone_justidy in the Tone. It could be explored
vertically and horizontally, combined with pie charts.
Vertically:
In the negative tone category, the
negative_word metric is the most dominant, while
capitalization, implicature,
strong_word, objective, and
supportive appear least frequently.
In the very negative tone, punctuation and
capitalization contribute the most.
In neutral and positive tones, only
objective and supportive appear,
respectively, as the sole contributing metrics.
Horizontally :
In negative_word, negative tone contributed
most, followed by very negative.
In strong_word, high_frequency,
capitalization, and implicature,
very negative tone contributed most.
In neutral and positive tones, only
objective and supportive appear,
respectively, as the sole contributing metrics.
In imperative, negative tone contributes
most.
In punctuation, negative tone contributes
most.
Description
Based on whether the commenter role is directly affected by PE, I
categorized roles into three groups: "patient",
"provider", and "other".For example,
pharmacists and providers are grouped together under the provider
category, while organization and attorney roles are grouped under
other.The “patient” and “provider” groups are directly affected by PE
through their involvement in the healthcare delivery chain. In contrast,
the “other” group is indirectly affected by PE via institutional, legal,
or organizational influences.
Role and corresponding specialties
The table “Commenter Role with Specialties” summarizes
all role names along with their corresponding specialty names. In the
specialties column, missing or unavailable information is represented by
a “–”. For example, the role nurse in our data corresponds to
specialties including both “NA” and “register,” where “NA” indicates
that the specialty information could not be directly extracted or
inferred from the text.
Distribution of General Role
"Distribution of General Role"pie chart shows the
proportion of the three groups: patient, provider, and other. The null
category represents missing or unavailable role information.
Distribution of Commenter_Role within the most common
general role groups
"Distribution of Commenter Roles in General Role: xxxx" pie
charts display the distribution of Commenter_Role within the top 1
general role groups.
Distribution of Specialty Within the Most Common
Commenter Role in the Most Common General Role Category
"Top xxxx Commenter Roles: Distribution of Specialty" bar
charts illustrate how specialties are distributed within a top n
Commenter_Role in a given top 1 general role category.
| Commenter Role with Specialties | |
| Commenter_Role | Specialties |
|---|---|
| anesthesiologist | - |
| attorney | - |
| caregiver | - |
| cilvil_servant | - |
| nurse | register, - |
| organization | ngo, coalition, labor_union |
| patient | -, millitary veteran, kidney transplant recipient |
| pharmacist | -, retired |
| physician | emergency, rheumatology, -, terminated |
| provider | -, emergency |
| psychotherapist | emergency, -, terminated |
[1] “physician” “nurse” “provider” “pharmacist”
[5] “psychotherapist”
These comments lack clear identifying information, making it
difficult to classify the commenter roles. So, most of them are labeled
as NA in Commenter_role
FTC-2024-0022-1916: Cannot determine the
commenter role; role inference is difficult
Comment:
> As someone who lived in Europe for 5 years, I can say for certain
that the US has the best health care and the most highly trained doctors
in the world. I moved back to the US specifically because I needed
better healthcare than I could find anywhere in Europe. Why would we
want to degrade that so that a tiny minority of people can reap huge
profits?? I am glad to see the government recognize the problems
presented by consolidation in health care industries, particularly
emergency rooms. Private equity firms often prioritize profits over
patient care and worker safety, leading to higher costs, reduced quality
of service, and increased risks for both patients and healthcare
workers. I urge you to consider implementing policies that would limit
or restrict the ability of private equity firms to acquire and operate
emergency rooms. This could include stricter regulations on mergers and
acquisitions in the healthcare industry, as well as greater transparency
and accountability requirements for these firms. Thank you.
FTC-2024-0022-0245: Cannot determine the
commenter role at all; text too short
Comment:
> Private equity has no business to be involved in the Medical
Professions.
FTC-2024-0022-1320: Commenter role unclear, but
“MD” appears in the name
Comment:
> It is unthinkable that the insurance company could own the hospital
or practice, thus, employing the doctors that they already stranglehold
with insurance denials. This is the definition of a nefarious
monopoly!
FTC-2024-0022-2091: Commenter role uncertain;
possibly nurse (mentions nursing)
Comment:
> Private Equity firms, large insurance companies have made working
in healthcare extremely unfulfilling with constant pressures to meet
unattainable patient volumes and constant cutting of staff and services
with expectation. Current employees will continue to do more. Nursing is
no longer about taking care of patients, but rather being able to check
enough boxes on a computer screen. Large insurance companies that have
constant denials on medication that are FDA approved for their condition
make it difficult to care for patients in a manner that follows
standards of care.
Chi-square test indicates that Commenter_Role is
statistically significantly associated with the Tone of the
comment (\(p<0.05\)).
The Contingency Table Heatmap shows counts for each combination
of Role ("patient", "provider",
"other", "no information") and Tone category
("negative", "very negative",
"neutral", "positive").
The histogram showing the distribution of tone within each role category reveals the following:
For commenters with no information on Role, contributions are relatively highest in the negative and very negative tones. Combining this with the longitudinal interpretation—that those directly experiencing PE tend to show negative or very negative tones—this suggests that many commenters without role information may actually belong to the patient or provider groups if forced to classify within patient, provider, or other.
For providers and patients, both groups are contributed more by negative or very negative tones compared to neutral or positive tones. Moreover, providers are contributed by more to the negative tone than the very negative tone compared to patients. This may be due to providers’ professional training and occupational discipline, which help them better regulate emotional expression when describing issues.
For Other, contribution from positive tone is the least, while contributions from negative, very negative, and neutral tones show no marked imbalance. This suggests that people who do not directly experience PE tend not to hold supportive attitudes toward PE, indicating that PE’s impact is broad and its societal effect is primarily negative.
word_count features the comment text length. For the comments recorded in the attached files, I count the words manually. For the comments recorded in the original data cell, I count the words through function str_count() in R.
Overall Summary of Word_count “Overall Summary Word Count” table and histogram show the overall distribution of word_count. It is obvious a long right-tail distribution, suggesting the further statistics test would be used should not the test that based on the normal distribution
Summary of Word_count by Tone: Kruskal-Wallis H test (non-parametric) shows us the there are significant differences in the distribution of word_count among different Tone groups; From the “Summary Word Count by Tone” table, the very negative tone’s comment is systematically shorter than the negative tone’s, but this different is not statistically significant; the difference between neutral tone’s word_count and negative tone’s word_cound is statistically significant; the difference between neutral tone’s word_count and very negative tone’s word_cound is statistically significant
Summary of Word_count by Role_category: Kruskal-Wallis H test and “Summary Word Count by Role_category” table tell us there is no statistically significant difference of word_count by Role_category
Summary of Word_count by concern level: To
explore whether comment length varies across different aspects of
concern, I constructed a new variable, Concern_Level, based
on four binary indicators: Patient_cost,
Patient_quality, Provider_pay, and
Provider_quality. I calculated the row-wise sum of these
four variables for each comment, resulting in a score ranging from 0 to
4, labeled as "No Concern" (0), "Low" (1),
"Medium" (2), "High" (3), and
"Highest" (4), interpreted as a measure of “concern
breadth”. To assess whether comment length differs significantly across
these concern levels, I first conducted group-wise descriptive
statistics of word_count(“Summary of Word Count by Concern Level”
table), followed by a Kruskal–Wallis test to detect any overall
differences. Since the Kruskal–Wallis test is significant (\(p<0.05\)), I applied Dunn’s post hoc
pairwise comparisons with Bonferroni correction to identify which
specific levels differed significantly in comment length. it shows that
having Low, Medium, and
High levels of concern were all significantly longer
than those with No Concern (p < 0.05 for all three
comparisons).
| Overall Summary Word Count | ||||||||
| count | missing | min | q1 | median | mean | q3 | max | sd |
|---|---|---|---|---|---|---|---|---|
| 100 | 0 | 1 | 46.25 | 125.5 | 333.41 | 226.25 | 6297 | 806.8054 |
| Summary Word Count by Tone | ||||||||
| Tone | count | min | q1 | median | mean | q3 | max | sd |
|---|---|---|---|---|---|---|---|---|
| negative | 44 | 1 | 77.25 | 142 | 387.7727 | 271.75 | 6297 | 1002.44962 |
| neutral | 8 | 3 | 3.00 | 12 | 37.8750 | 21.25 | 227 | 76.89592 |
| positve | 1 | 3 | 3.00 | 3 | 3.0000 | 3.00 | 3 | NA |
| very_negative | 47 | 3 | 68.00 | 130 | 339.8511 | 173.50 | 3724 | 663.49463 |
| Kruskal-Wallis Test for Word Count by Tone | ||
| Statistic | Df | P_value |
|---|---|---|
| 13.693 | 3 | 0.003354 |
| Dunn's Test for Word Count by Tone | |||
| Comparison | Z | P.unadj | P.adj |
|---|---|---|---|
| negative - neutral | 3.3397435 | 0.000838558 | 0.005031 |
| negative - positve | 1.7401757 | 0.081828165 | 0.490969 |
| neutral - positve | 0.4489645 | 0.653457281 | 1.000000 |
| negative - very_negative | 0.7190723 | 0.472096352 | 1.000000 |
| neutral - very_negative | -2.9618691 | 0.003057778 | 0.018347 |
| positve - very_negative | -1.5921502 | 0.111350954 | 0.668106 |
| Summary Word Count by Role_category | ||||||||
| Role_category | count | min | q1 | median | mean | q3 | max | sd |
|---|---|---|---|---|---|---|---|---|
| other | 12 | 1 | 3.0 | 62.0 | 257.6667 | 238.00 | 1741 | 497.5126 |
| patient | 20 | 3 | 138.5 | 178.5 | 258.7000 | 264.25 | 1199 | 262.9307 |
| provider | 29 | 3 | 84.0 | 126.0 | 307.7931 | 294.00 | 2720 | 516.0690 |
| NA | 39 | 3 | 31.5 | 78.0 | 414.0769 | 132.50 | 6297 | 1175.3598 |
| Kruskal-Wallis Test for Word Count by Role_category | ||
| Statistic | Df | P_value |
|---|---|---|
| 3.658 | 2 | 0.1606 |
| Dunn's Test for Word Count by Role_category | |||
| Comparison | Z | P.unadj | P.adj |
|---|---|---|---|
| other - patient | -1.9095797 | 0.05618735 | 0.1686 |
| other - provider | -1.1887238 | 0.23454838 | 0.7036 |
| patient - provider | 0.9951801 | 0.31964867 | 0.9589 |
| Summary of Word Count by Concern Level | ||||||||
| Concern_Level | count | min | q1 | median | mean | q3 | max | sd |
|---|---|---|---|---|---|---|---|---|
| No Concern | 9 | 3 | 5.00 | 13.0 | 21.11111 | 25.00 | 61 | 20.43554 |
| Low | 19 | 19 | 57.50 | 118.0 | 382.05263 | 153.50 | 3724 | 894.03781 |
| Medium | 40 | 3 | 72.75 | 131.0 | 402.80000 | 259.50 | 6297 | 1053.41528 |
| High | 22 | 66 | 121.00 | 155.5 | 364.81818 | 358.50 | 1805 | 476.34298 |
| Highest | 10 | 1 | 3.00 | 53.5 | 175.40000 | 196.75 | 756 | 262.97917 |
| Kruskal-Wallis Test for Word Count by Concern_Level | ||
| Statistic | Df | P_value |
|---|---|---|
| 21.393 | 4 | 0.0002646 |
| Dunn's Test for Word Count by Concern_Level | |||
| Comparison | Z | P.unadj | P.adj |
|---|---|---|---|
| High - Highest | 2.4292043 | 0.0151320013 | 0.1513200 |
| High - Low | 1.6348864 | 0.1020728561 | 1.0000000 |
| Highest - Low | -1.0608067 | 0.2887777588 | 1.0000000 |
| High - Medium | 1.2417371 | 0.2143335735 | 1.0000000 |
| Highest - Medium | -1.6881877 | 0.0913752034 | 0.9137520 |
| Low - Medium | -0.6547435 | 0.5126328864 | 1.0000000 |
| High - No Concern | 4.2979731 | 0.0000172367 | 0.0001724 |
| Highest - No Concern | 1.6849391 | 0.0920003204 | 0.9200032 |
| Low - No Concern | 2.9373759 | 0.0033100264 | 0.0331003 |
| Medium - No Concern | 3.7162380 | 0.0002022111 | 0.0020221 |
write.csv(df,"100_entries_sample.csv")